In [5]:
# Import the pandas, pandas_profiling and seaborn libraries
import pandas as pd
import pandas_profiling
import seaborn as sns
In [6]:
# Import the “bi_product.txt” and the “bi_salesFact.txt” files
product=pd.read_table("/Users/Nicolas/Desktop/Masterstudium /Semester 3/Ljubljana/Big Data Management/Datasets/Python/bi_product.txt", sep=";")
In [7]:
fact=pd.read_table("/Users/Nicolas/Desktop/Masterstudium /Semester 3/Ljubljana/Big Data Management/Datasets/Python/bi_salesFact.txt", sep=";")
In [8]:
# Checking if both tables are successfully read
fact
Out[8]:
ProductID zip Units Revenue Datekey city state latitude longitude
0 706 75754 1 283.4475 20100101 Ben Wheeler TX 32.412451 -95.667110
1 1730 32317 1 786.9750 20100101 Tallahassee FL 30.479347 -84.346204
2 1789 76463 1 944.9475 20100101 Mingus TX 32.493233 -98.418050
3 1826 46158 1 220.4475 20100101 Mooresville IN 39.588545 -86.374310
4 2332 28128 1 534.9750 20100101 Norwood NC 35.228831 -80.152140
... ... ... ... ... ... ... ... ... ...
2035533 577 28307 1 1023.6975 20150421 Fort Bragg NC 35.142321 -79.012250
2035534 578 65052 1 787.4475 20150421 Linn Creek MO 38.058473 -92.685500
2035535 578 74019 1 787.4475 20150421 Claremore OK 36.280000 -95.490000
2035536 578 83429 1 787.4475 20150421 Island Park ID 44.490674 -111.366310
2035537 585 60467 1 419.9475 20150421 Orland Park IL 41.598255 -87.892010

2035538 rows × 9 columns

In [10]:
# Checking if both tables are successfully read
product
Out[10]:
ProductID Product Category Segment ManufacturerID
0 1 Abbas MA-01 Mix All Season 1
1 2 Abbas MA-02 Mix All Season 1
2 3 Abbas MA-03 Mix All Season 1
3 4 Abbas MA-04 Mix All Season 1
4 5 Abbas MA-05 Mix All Season 1
... ... ... ... ... ...
2407 2408 Aliqui YY-17 Youth Youth 2
2408 2409 Aliqui YY-18 Youth Youth 2
2409 2410 Aliqui YY-19 Youth Youth 2
2410 2411 Aliqui YY-20 Youth Youth 2
2411 2412 Aliqui YY-21 Youth Youth 2

2412 rows × 5 columns

In [11]:
# Merging the tables (product and fact) with pandas to create a table “total”
total= pd.merge(product, fact, on=["ProductID"])
In [70]:
# Checking if the tables were successfully merged
total
Out[70]:
ProductID Product Category Segment ManufacturerID zip Units Revenue Datekey city state latitude longitude
0 1 Abbas MA-01 Mix All Season 1 30116 1 412.1250 20121020 Carrollton GA 33.560454 -85.02254
1 1 Abbas MA-01 Mix All Season 1 90630 1 412.1250 20121010 Cypress CA 33.817481 -118.03990
2 1 Abbas MA-01 Mix All Season 1 95358 1 412.1250 20121011 Modesto CA 37.622898 -121.05181
3 1 Abbas MA-01 Mix All Season 1 78260 1 412.1250 20121013 San Antonio TX 29.707273 -98.47969
4 1 Abbas MA-01 Mix All Season 1 91343 1 412.1250 20121015 North Hills CA 34.239503 -118.48224
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2035533 2412 Aliqui YY-21 Youth Youth 2 89128 1 107.5725 20150529 Las Vegas NV 36.193501 -115.26500
2035534 2412 Aliqui YY-21 Youth Youth 2 22205 1 107.5725 20150529 Arlington VA 38.882750 -77.13999
2035535 2412 Aliqui YY-21 Youth Youth 2 28579 1 107.5725 20150529 Smyrna NC 34.753595 -76.51840
2035536 2412 Aliqui YY-21 Youth Youth 2 97031 1 99.6975 20150529 Hood River OR 45.673147 -121.53680
2035537 2412 Aliqui YY-21 Youth Youth 2 28584 1 107.5725 20150519 Swansboro NC 34.697710 -77.10614

2035538 rows × 13 columns

In [71]:
# Getting an overview of the data with profile_report()
total.profile_report()
Out[71]:

By using profile_report, specific information about the data will be displayed. For example, that there are 13 attributes and 2,035,538 observations. Of these 13 attributes, 8 are numerical and 5 are categorial. Also it is visible that some values are missing. After this basic overview specific information about the attributes are also displayed. For example the category variable shows that most of the sold products belong to the attribute value “Urban”. Also, the most products are delivered to the city of Miami, whereas the state to which the most products are sent to is California. After the concrete information to every attribute is given, also the correlation between different attributes is displayed. Because some of the values that are displayed are numeric in datatype but have to interpreted as polynomial, the direction of the shown correlation is hard to interpret. For example, there is a negative correlation displayed between the Revenue and the productID. But since the product IDs are numerical in type but have to be used as polynomial, one has to be careful with the interpretation. In general, there seem to be slight correlations between zip codes and longtitue, Revenue and ProductID as well as Revenue and Units.

This observation of the data structure helps to determine which further steps are necessary or possible to perform a valid data analysis. For example, correlations can lead to further investigation or replacing missing values can be crucial to increase the data quality.

Since the overview displayed that there are missing values in the dataset, we have to think about how to solve this problem. At first, we have to figure out how many values are missing and at which attributes.

In [12]:
print(total.isna().sum())
ProductID           0
Product             0
Category            0
Segment             0
ManufacturerID      0
zip                 0
Units               0
Revenue           280
Datekey             0
city                0
state               0
latitude            0
longitude           0
dtype: int64

After this step, one knows that there are 280 values missing, all at the attribute “Revenue”. Since the dataset has 2,035,538 observations, it is easiest to just exclude these said observations to increase the data quality.

In [13]:
# Excluding the N/As of the dataset
total_c=total.dropna()
In [74]:
# After this procedure, the dataset has now 2,035,258 observations
total_c
Out[74]:
ProductID Product Category Segment ManufacturerID zip Units Revenue Datekey city state latitude longitude
0 1 Abbas MA-01 Mix All Season 1 30116 1 412.1250 20121020 Carrollton GA 33.560454 -85.02254
1 1 Abbas MA-01 Mix All Season 1 90630 1 412.1250 20121010 Cypress CA 33.817481 -118.03990
2 1 Abbas MA-01 Mix All Season 1 95358 1 412.1250 20121011 Modesto CA 37.622898 -121.05181
3 1 Abbas MA-01 Mix All Season 1 78260 1 412.1250 20121013 San Antonio TX 29.707273 -98.47969
4 1 Abbas MA-01 Mix All Season 1 91343 1 412.1250 20121015 North Hills CA 34.239503 -118.48224
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2035533 2412 Aliqui YY-21 Youth Youth 2 89128 1 107.5725 20150529 Las Vegas NV 36.193501 -115.26500
2035534 2412 Aliqui YY-21 Youth Youth 2 22205 1 107.5725 20150529 Arlington VA 38.882750 -77.13999
2035535 2412 Aliqui YY-21 Youth Youth 2 28579 1 107.5725 20150529 Smyrna NC 34.753595 -76.51840
2035536 2412 Aliqui YY-21 Youth Youth 2 97031 1 99.6975 20150529 Hood River OR 45.673147 -121.53680
2035537 2412 Aliqui YY-21 Youth Youth 2 28584 1 107.5725 20150519 Swansboro NC 34.697710 -77.10614

2035258 rows × 13 columns

In [80]:
# Grouping the data by Category
total_grouped=total_c.groupby(["Category"], as_index=False)
In [81]:
# Displaying the sum of the sold units and the sum of the revenue for every product
total_grouped["Units","Revenue"].sum()
Out[81]:
Category Units Revenue
0 Mix 124829 5.374700e+07
1 Rural 521534 1.313479e+08
2 Urban 1310034 1.047594e+09
3 Youth 145065 2.738623e+07
In [77]:
# Getting the mean of the revenue
total_grouped["Revenue"].mean()
Out[77]:
Category Revenue
0 Mix 443.988240
1 Rural 265.772350
2 Urban 817.238625
3 Youth 198.277063
In [78]:
# Getting the median of the revenue
total_grouped["Revenue"].median()
Out[78]:
Category Revenue
0 Mix 362.1975
1 Rural 209.4750
2 Urban 782.1975
3 Youth 157.4475

By analyzing the previous tables, it is visible that for every of the four categories the mean revenue is higher than the median revenue. Therefore, a positive skewness is present, signifying that the right tail is longer and that the mass of the distribution is on the left side. As regard to the content this means that the majority of products were sold below the mean price. Some very expensive products were sold in contrast to many “cheaper” goods. The high amount of outliers influences the distribution and is shown through the fact that the mean of the categories is higher than the median. To further analyze this, one can use the max and the mode operator to detect the highest and the most frequent values and determine their distance.

In [21]:
# Calculating the mean value
total_c["Revenue"].mode()
Out[21]:
0    419.9475
dtype: float64
In [22]:
# Calculating the max value 
total_c["Revenue"].max()
Out[22]:
109719.225

By using the two previous steps, one can see that the most common value is 419.95. This value is very far away from the highest value 109,719.23. This very big difference shows that the outlier heavily influences the mean. This also is an indicator for the skewness of the distribution.

We want to create graphs to visually display the connection between the categories and the revenue. Since we have one categorial value (Category) and one numerical (Revenue) not every graph is suitable for the analysis of the data.

Therefore we first only produce one scatterplot and one boxplot for descriptive analysis.

In [54]:
# To be able to modify the titles of the graphs, we have to import matplotlib
import matplotlib.pyplot as plt
In [68]:
graph1=sns.scatterplot(x="Segment", y="Revenue", hue="Category", data=total_c);
plt.title("Scatterplot \n Product revenues by segment \n", weight="bold");
plt.show()
In [67]:
graph2=sns.boxplot(x="Segment", y="Revenue", hue="Category", data=total_c);
plt.title("Boxplot \n Product revenues by segment \n", weight="bold");
plt.show()

By creating a scatterplot with revenue on the y-axis, segment on the x-axis and category as facet one can firstly see by the colors of the respecting attributes, which segments belong to which category. The „Mix“ category represents the segment "All Season". The „Rural“ category represents "Productivity" and "Select". The „Youth“ category is only displayed by the „Youth“ segment. Lastly the biggest Revenue contributer "Urban" also consist of the most segments "Moderation", "Regular", "Extreme" and "Convenience".

This plot futhermore shows the revenue for each order, that was placed within each segment. Therefore we get an overview of the structure of the order. Beside the assignment of the segments one can also see the skewness of the distribution visually. All the categories „Mix“, „Urban“ and „Youth“ have outliers that are distant from the mass of the values. Especially within the segment "Moderation" is an expetionally big outlier is detectable. This is also the maximum value that was displayed in one of the previous actions. One can also see that the products of the rural category have the smallest standard deviation wheras Urban has the biggest standard deviation.

By creating a boxplot, we can get the additional information where the median of each segment is and compare this to the distribution. With this done, it is even more visible that there are a lot of products with just small amounts of Revenues. Almost every segment has a median below 1000, but a certain amount of products are also sold for much more. This furthermore is again an indicator for the skewness of the data.

If we want to get a more aggregrated view, we have to group the data again and then apply methods to calculate the average or the sum for each segment.

In [73]:
# Average revenue by Segment
graph3=sns.catplot(x="Segment", y="Revenue", hue="Category", kind="bar", data=total_c);
plt.title("Barplot \n Average revenues by segment \n", weight="bold");
plt.show()
In [74]:
# Total revenue by segment
total_grouped2=total_c.groupby(["Category", "Segment"] , as_index=False)
total_rev=total_grouped2["Revenue"].sum()
graph4=sns.catplot(x="Segment", y="Revenue", hue="Category", kind="bar", data=total_rev);
plt.title("Barplot \n Total revenues by segment \n", weight="bold");
plt.show()

With these two plots it is even more obvious, that the Urban category is the most profitable one for the company. Especially with the total revenue plot, one gets an aggregated view of the data and can see the relation between the product categorties.

To even more get more insights, we can take an isolated look at the most diverse segment "Moderation".

In [76]:
filter1=total_c["Segment"].isin(["Moderation"])
Moderation=total_c[filter1]
In [77]:
M_graph=pd.Series(Moderation["Revenue"], name="Revenue")
graph5=sns.distplot(M_graph, hist=False);
plt.title("Revenue distribution for Moderation \n", weight="bold");
plt.show()

By analyzing the distribution of the „Moderation“ value it is also possible to detect that there are many values of low revenues. Then there is a very large gap and later on we can see some values for revenues above 100,000. This emphasizes the large range of values within the „Moderation“ segment, which needs special caution with the further analysis.

With this descriptive graphs we can show how the data is distributed. This is important for the further steps of the inductive data analysis. Especially the product category „urban“ is very interesting because it seems to be the most profitable but also has the biggest ranges of all categories. By knowing about these ranges and outliers, one can maybe prepare the data properly and therefore get insights that are more valuable. Because we now can state certain specifications of the data, like skewnesses, and we can create hypotheses where connections between the values are and test these afterwards.

In [ ]: